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ABSTRACT 

^ This study develops a procedure^for^ detecting items 

which are biased for particular ethnic groups and uViliaes this 
procedure to evaluate the fairness of reading^ mathematics^ and 
occupational information test items for several ethnic groups. The 
population for each ethnic group was chosen from. examinee5= 
administered the 1973 version of the Florida Eighth. Grade Testing 
Program (FSGTP) . In this study^ an item was considered biased if it 
manifested an Item X ^Sroup interaction • Few biased itejos were 
detected on the Reading^ Mathematics^ and Occupational Information 
tests of the FEGTP. (Author/RC) 
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ABSTRACT* 

The purpose of this study was to develop a procedui^e for detect- 
ing items which are biased for* particular ethnic groups and to 
utilize this procedure to evaluate the fairness of reading, niathe- 
matics, and occupational information test items for several ethnic 
groups. 

The population for each ethnic group was chosen from examinees 
administered the 1973 version of the Florida Eighth Grade Testing 
Program (FEGTP). / 

In this study, an item was considered biased if it manifested 
an Item X Group interaction. Few biased items were detected on the 
Reading, Mathematics, and Occupational Information tests of the 
FEGTP. 
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AN INVESTIGATION OF THE FAIRNESS OF 
THE ITEMS OF A TEST BATTERY^ 
i . ' llonald L. Fishbein 

Michigan Department of Education 
This study focused on assessing the fairness of the Items of 
several tests when a predicted criterion variable was unavailable. 
A procedure was developed for detecting items which are unfair or 
biased for particular ethnic groups, and this procedure was uti- 
lized to evaluate the fairness of reading, mathematics, and occupa- 
tional information tast items for several ethnic groups. 

The definition of bias used in this study should not necessar- 
ily be equated with the term "cultural bias." For the purposes 
of this study an item was considered biased against a group compared 
with another group if the item manifested an Item X Group inter- 
action. The statistical procedure employed to detect Interaction 
Identified items where a group's mean on an item was higher or 
lower than another group's mean on the item by an amount higher 
or lower than would be expected from a comparison of both groups' 
total test performance (see Cleary & Hilton, 1968). 

The bias of the Reading (vocabulary and comprehension). Mathe- 
matics (computation and problem solving), and Occupational Informa- 
tlpn test items of the 1973 Florida Eighth Grade Testing Program 
(FEGTP) was assessed in thxs study. The FEGTP is a basic skills 
test battery administered annually to virtually every eighth grade 
student in the state of Florida. The three tests evaluated were 
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The groups considered in this study were: (1) White Caucasians, 
(2) Black Afro-Americans, (3) American Indians, (4) Orientals, 
(5) Puerto Rlcan Americans, (6) CubariJ'Amerlcans, (7) Males, (8) Fe- 
males, (9) Urban examinees, and (lOX Rural examinees* Examinee 
classifications were determined from the Information provided by 
the examinee on the answer sheet of the test battery under the 
categories Race Code and Sex* In addition, , an examinee was ^las- 
slfled as urban If he participated In the^l973 test administration 
In a county with at least 96*1% of the population defined as urban 
for 1970 by the U* S* Department of Commerce, Bureau of the Census, 
and as rural If he participated in a county with 0*0% of the pop- 
ulation defined as urban* , ' 
* Several previous studies have evaluated test fairness without 
the use of a predicted criterion variable* These studies have 
used analysis of variance (ANOVA) procc^dures to detect significant 
Item X Group Interactions* For example, Cardall and Coffman (1964) 
assessed the fairness of the itemffc of the Scholastic Aptitude Test 
(SAT) for Rural, Urban, and Black examinees using a two factor 
ANOVA design with repeated measures on Items* The significant 
Item X Group interactions indicated that some items of the SAT 
may have had different relative difficulties for the groups ex^- 
ined* Similar investigations (Angoff & Sharon, 1974; Cleary & 
Hilton, 1968) have klso detected significant Item X Group inter- 
actions* 

Angoff and Sharon (1974) noted that a majbr limitation of 
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using ANOVA to detect Item X Group interaction is the failure 
to detect the specific items coijt^ibutihg to the interaction. 
They attempted to overcome this shortcoming by producing a bi- 
variate plat bf item difficulty values for each pair of groups 
being compared and calculated the perpendicular distance of each 
point from the major axis' of the elliptical plot of item points < 
However, Angoff and Sharon did not attempt to specify how deviant 
an item had to be before it should be labelled as biased. The 
technique used in this study detected significant Item X Group 
interactions at the level of the individual item, 

METHOD 

i 

Samples ' 

Samples were chosen from the data of the 1973 administration 
of the FEGTP, Five systematic samples. of 225 examinees in each 
sampled were chosen from the population of White examinees; five 
systematic samples of 225 examinees in each sample were chosen 
from the population of Black examinees; and five systematic 
samples of 225 examinees in each sample were chosen |rom the 
population of Cuban American examinees. The sampled from the 
, White, Black, and Cuban American populations were mutually ex- 
clusive. Systematic samples of 225 -examinees were "also chosen 
from the populations of each of the following groups: American 
Indians, Puerto Rican Americans*, Males, Females. Urban examinees,, 
and Rural examinees. In addition, a systematic sample of 224 
examinees was chosen from the population of Oriental examinees. ^ 
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All samples from each population were chosen without replacement* \ 
Since the item responses of the examinees who participated in the 
testing program were grouped by county, and by school within each 
county, the systematic samples chosen were, for practical purposes, 
equivalent to stratified samples where the number' of examinees 

-chosen from any school represented approximately the proportion 

1 

of examinees from that school in the population of the group sampled* 
Procedure 

In this study the term bias was always used in a comparative 
sense* An item was biased against a group compared with another 
group. For the purpose of determining whether- test items are bias- 
ed against certain ethnic groups, aiv item wa^ cqnsidered biafsed 
against a group compared with another group If the item manifested 
an Item X Group interaction. Stated differently, an item was 
considered biased if the difference in performance on the item 
for the two groups was significantly different thatt the difference 
in their ovetall performance on the test* If the diffetmice in 
•performance' on an item was significantly less than the overall^ 
difference in performance between the two samples of a comparison, 
then the item was considered biased against the group having the 
higher overall performance. If the difference In performance on 
an item was significantly greater than the overall difference in 
performance between the two groups of a comparison, then the item 
was considered biased against the group having the lower overall 

performance. The 29 comparisons made on each test that was assessed 

■f 

for bias are shown in Table 1. 
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The following procedure was employed to dete'imine whether tha 
reading, mathematics, and occupational informatibn test items of 
the FEGTP were biased against certain ethnic groups* 

The population p value on the ith test for the jth group on 
the kth item was set equal to Pijk, and the average population p 
value- on the ith test for the jth group was set equal' to Pij.. Then, 
for ..example, to test the fairness of the third item on the second * 
test for groups one and six the following statistical hypotheses 

wexe formed, where A « p * - p , * . 

21 26 ■ . 



H : p ' - p , « ^ 

0 213 263 ^ - {ly 

H : p , - p A 

1 213 *^263 (2) 

Thp null hypothesis was tested by forming the following confidence 
interval, ^here p^jj^ equaled the p value on the- ith test for the jth 
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group on the kth item, and % j equaled the number of examinees from 
the jth gro,up who had taken the ith test (see Marascuilo, 1971). 
If the confidence" interval did not include A, then the null hypoth- 
esis was rejected, with the probability of a Ty^e I error equal- 
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to a. 

Each of the three tests assessed for bias was considered an 
experiment. For ^ach item of each test that was assessed for bias, 
the 29 comparisons listed in Table 1 were made with a •« .0002 for 
each comparison. With a equal to ^0002 for a comparison, the Type I 

^ error rate for an item was approximately .0058. Therefore, the 
test a for reading was approximately .348; the test a for -mathematics 

• was approximately .452; and the test a for occupational information 
was approximately .232. If one used an alternative hypothesis. that \ 
^/ the least significant difference of interest for Equation. 2 was .25, , 
then.«the power bf each statistical comparison was approximately .94 
(se'e Maxascuilo, 1971, p. 301). This calculation assumed a maximum 
Standard error of the difference between the' two proportions. Cohen 

, (1969) has defined a difference between two independent proportions 

.. • - ' / . 

of approximately .25 as a medium effect «size. 
• . * RESULTS.. 

The major finding of this study was -that there-were few biased 
items on the Reading, F^thematics, and Occupational Information 
tests of the FEGTP when .an item was defined as biased if it mani- 
fested an Item X Group- interaction. The percentage of biased compari- 
sons on the Reading test was 3.68; the. percentage of biased comparisons 
on the Mathematics test was 1.89; and the percentage of Biased compari- 
sons on the Occupational Information test was 1.58. However, it , 
should be pointed out that 7 out of 1^740 comparisons on the Reading 
test, 86 out of 2,262 comparisons on the Mathematics test, and 19 
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.out of 1,160 comparisons on the Occupational Information test were 
eliminated from consideration because of a ceiling effe<::t~the p 
values for both groups were .higK enough that it was impossible, 
or inconceivable, for the group with the larger .test mean to out- 
score the comparison group by a value as large as A. Also, 19 
comparisons were eliminated from the Reading test' because of a floor 
effect— the p values for both groups were below, at, or slightly 
above the chance level* 

y Table 2 indicates the percentage of biased itema on th6 Reading 
test for each comparison of the study. The Cuban-Indian comparison, 
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which contained the highest percentage of biased items on the Reading 

test, had an equal number of items biased .against Cubans as American 

Indians On the White-Cuban comparison there were 11 instances of 

bi^s against Cubans and 9 instances of bias against Whites. However-, 

" ' ' . . • ^ . * . 

all nine ihstances of bias against Whites on the White-Cuban, compari- 

son were vocabulary items which resembled the Spanish translation and, 

therefore, gave an unusual advantage to Cuban American examinees. On 

the Black-Cuban comparison there were 7 instances of bias against 

Cubans and 13 instances of bias against Blacks. On the l^hite-Black 

comparison " there were- seven instances of bias against Blafcks and six 

instances of bias against Whites. There was no evidence of bias on 



the remaining Reading test compari9ons* 

Table. 3 indicates the percentage of biased items on the Mathe- 
matics test for each comparison of the study. The Wliite-Indiari 
comparison, which had the highest percentage of biased items on the 
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Mathematics test, contained 11 items biased against American Indians 
and no items biased against Whites.- On the Oriental-Indian comparison 
there was one instance of bias against Orientals and four instances 
*of bias against American Indians. On the Black-Oriental comparison 
there were thr^e instances of bias against Blacks and' no instances 
of bias against Orientals. There was no evidence of bias on the 
remaining Mathematics test comparisons. - 

Table 4 indicates the percentage of biased items on the Occu- 
pational Information test for each comparison of the study. The 
only comparisons which showed bias on the Occupational Information 
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test were White-Black, Black-Oriental, Black-Cuban, and Male-Female. 
The Black-Cuban comparison showed tl\e 'highest percentage of biased 
items on the Occupational Information test. There were five instances 
of bias against Blacks and four instances of bias against Cubans. 
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Specifically, the following generalizations concerning the 
Reading, Mathematics, and Occupational Information tests seem war- 
ranted. There was a greater tendency for reading vocabulary items 
to exhibit bias than reading comprehension items. This was true 
even after taking into account that the Vocabulary subtest was 
twice as long as the Comprehension subtest. This tendency was es- 
pecially pronounced for Blacks, where all 20 instances of bias 
against Blacks on the Reading test were vocabulary items. The 

researcher wafi unable to explain this -unexpected occurrence. 

j 

There was virtually no, evidence of bias on Male-Female and 
Urban-Rural comparisons. There were also few instances of- bias 
against Oriental and Puerto Rican examinees. When bias was detected, 
items were most pften biased against Whites, Blacks, Cubans, and 
American Indians.. B^.as against Blacks, Cubans, and Indians was 
expected, but bias against Whites was unforeseen. However, items 
bifised against Whites were often detected on White-Cuban comparisons 
and, as mentioned previously, could be explained because the biased 
item was a vocabulary word which resembled the Spanish translation. 
^ There was a tendency for a relatively higher percentage of 
comparisons to show bias on the Problem Solving section of the 
Mathematics test than on the Computation section. This result was 
consistent with expectations, unlike the finding that a higher per- 
centage of reading vocabulary items was biased than reading compre- 
hension items. 

DISCUSSION AND RECOMMENDATIONS 
This study failed to detect a substantial degree of Item X 
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Group interaction for the items of the Reading, Mathematics, and 
Occupational Information tests of the FEGTP. Since the test. items 
which were assessed for bias are .rppresentative of basic skills 
Itest items given below the college level, and to the extent that 
Florida students are representative of the nation, similar results 
might be obtained with other achievement batteries in other areas 
of .the country* Assuming that similar-^results would be obtained, 
what would be demonstrated? The logic of statistical inference 
ddes not permit one to prove a null hypothesis, and one would/ only 
be justified in saying that since the hypothesis of no Item ^ Group 
interaction was usually not rejected, one can continue to entertain 
the hypothesis that there is little interaction. Even if the re- 
searcher could have proved the null hypothesis for every comparison 
of this study, item fairness would not have been demonstrated. A 
finding of no Item X Group Interaction means that a test item is 
functioning in a homogeneous manner in terms of the relative dif- 
ficulty for the groups of a comparison. However, the possibility 
exists that all of the items of a test may be biased against a 
particular group, but none of the items vk)uld display interaction 
because they are all biased in a^ similar manner. It is also possible 
that an item detected as biased was actually fair, but was labeled 
biased because most of the other items on the test were biased. 

The reading and ^thematlcs test items assessed in this study 
were developed by a major commercial testing company and the occu- 
pational information test items were developed by the staff of the 
FEGTP. These items had undergone considerable editing and al] three 
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tests had been administered to representative Florida samples before 

the final versions were printed. Even with these .safeguards, possible 

flaws were found in many of the biased items. Once weaknesses were 

detected, logical revisions seemed possible. 

It would be extremely advantageous if biased items would be 

detected during the test development stage. This would b^ an es- 

•_I^cially important donsideraticn if the degree of bias on other " 

test batteries were found to' be greater than that of the FEGTP. 

Several of the items which manifested bias were examined by 

a Black and several native Spanish speaking graduate students. 

They , were able to propose logical explanations f^r the behavior- 

of many biased itejoas. The present writer, a White male, was unable 

to detect many of these weakne'^sses.^ This would indicate that test 

fairness would probably be^ improved if members of minority groups 

p 

would edit items on standardized tests. It also seems logical that 
, test fairness would be improved if minority groups were included 
on committees which determine the objectives and content to be tested. 
This seems particularly important in the development of criterion 
referenced tests. 

In conclusion, it should be emphasized that a question as con- 
troversial as the fairness of psychological test items cannot Be 
resolved by psychometric debate. When minority group is no longer 
s3monomou8 with lower scoring group, the issue of test bias will 
cease to exist. i 
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V\ FOOTNOTES 

^This paper Is based upon the author's Ph.D. dissertation submitted 
^ to the faculty of the Educational Evaluation and Research Design Program, 
- The Florida State University. Helpful suggestions were made by Jacob G. 
Beard (major professor), Harman D. Burck, Garrett R. Foster, John R. Hills, 
Howard W. S^toker, and Gerald J. Schluck. 

The computer programs for this study were written by Dr. Philippe 
Olivier and Mrs. Marjorle Olivier. 

The sample of Oriental examinees was 224 because of ^ programing 
error. , ' 
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TABLE 1 

Reading, Jfathematlcii ^ atici OccUpatldnAl liifotmatloh 
Xteffi Comparisons 





« 

Compiirison Number 


Compatlson 


5> 



-1 


Whitiil VS. Biacki . 


2 


White2 vs. Black2 


i- 


Whites vs. Blacks 


4 


Whlteij vsi Black4 


5 


Whites vs. Blacks X 


6 


Whltej Vs. Oi'lental 


7 


White J vs. Cuban Atneficatij 


8 


Whlte2 vs. Cuban Aiaerican2 


9 . " 


Whites vi. Cuban Amcfidahs 


10 


< 

Whiter vii« Cuban Amerlcati^ 


11 


Whites vs. Cuban Atliericans 


12 


Whitej va. Americaii Indiitek 


13 


Whitei.vs. Puerto Ulcan American 


14 


Blacki vs. Oiletttal 


15 


Blacky vif. Cuban! American 


16 


BlMck2 vs. Cuban Adiefican^ 


17 


Blacks vs. Cuban Americans 


18 


Blacki^ VSi Cuban American^ 


Blacks vs. Cuban Americans 
Blacki vs.- Am^lcari Indian 
Blacki vs.^PU^rto Rican American 


20 


21 


22 


Oriental vs# /cuban Americafli | 


23 , . 


Oriental vs./ American Indian 


24' 


Oriental vs/ Puetto Rican American ' 

Cuban American i vs. American Indian ' 

/ 

Cuban AAer^cani vs . Puerto Rican American 


25 


26 


27 


American Indian vs. Puerto Rican American 


28 


f 

Rural vs. Urban 


29 


i 

Male vs. Female » 
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TABLE 2 

Percentage of Biased Comparisons Aniong Ethnic Groups 

* 

Reading Total 



Comparison 



Percentage Siased 



White-Black^ 




^ White-Oriental 


0.00 


Whlte-Cubanb 


6.67 


White Indian 


0.00 


White-Puerto Rican 


0.00 


Black-Oriental 


1 69 


Black-Cuban^ 


6 94 




1.67 


Black-Puerto Rlcan 


0.00 


Oriental-Cuban 


3.33 


Oriental-Indian 


0.00 


Qrlental-Puerto Rlcan 


0.00 


Cuban-Indian 


10.00 


Ciiban-Puerto Rlcan 


1.67 


Indian-Puerto Rlcan 


0.00 


Rural-Urban 


0.00 


Male-Female 


0.00 



^Based upon five White-Black Comparisons. 
^Based upon five White-Cuban comparisons. 
^Baaed upon five BJack-Cuban comparisons. 
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TABLE 3 

♦ 

Percentage of Biased Conparleouf Among Ethnic Groups 
Mathematics Total 



Comparison 



Percentage Biased 



White-Black* 

White-Oriental 

White-Cubant> 

White-Indian 

White-Puerto Rican 

Black-Oriental 

Black-Cuban^ 

Black-Indian 

Black-Puerto Rlcan 

Oriental-Cuban- 

Or ien tal- Ind ian 

Oriental-Puerto Rlcan 

Cuban-Indian 

Cuban-Puerto Rican 

Indian-Puerto Rlcan 

Rural-Urban 

Male-Female 



3*00 
0,00 
0,26 
11.84 
0,00 
4.17 
1,88 
2,60 
0.00 
0,00 
7,04 
0.00 
2.74 
0,00 
1.28 
0.00 
0.00 



^Based upon five White-Black comparisons. 
''Based upon five White-Cuban comparisons. 
^B'ased upon five Black-Cuban comparisons. 
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TABLE 4 

Percentage of Biased Comparisons Among Ethnic Groups 
Occupational Information 



T 



€oaq>arison 



Percentage Biased 



W^ite~I 



-Black* 
White-Oriental 
White-Cuban^ 
White-Indian 
White-Puerto Ricap 
Black-Oriental 
Black-Cuban^ 

Black-Indian 

/ 

Black-Puerto Rican 
Oriental-Cuban 
Oriental-Indian 
Oriental-Puerto Rican 
Cuban*- Indian 
Cuban-Puerto Rican 
Indian-Puerto Rican 
Rural-Urban 
Male-Female 



3.65 

0*00 

0.00 
^ 0.00 

0.00 

2.56 

4.66 

0.00 

0.00 

0.00_ ^ 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

2.50 



a " • ^\ 

Based upon five White-Black comparisons. 

"^Based upon five White-Cuban comparisons. 

^Based upon five Black-Cuban comparisons. 
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